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Abstract 

Background: The primary objectives of this paper are: 1.) to apply Statistical Learning Theory (SLT), specifically 
Partial Least Squares (PLS) and Kernelized PLS (K-PLS), to the universal "feature-rich/case-poor" (also known as "large 
p small n", or "high-dimension, low-sample size") microarray problem by eliminating those features (or probes) that 
do not contribute to the "best" chromosome bio-markers for lung cancer, and 2.) quantitatively measure and verify 
(by an independent means) the efficacy of this PLS process. A secondary objective is to integrate these significant 
improvements in diagnostic and prognostic biomedical applications into the clinical research arena. That is, to 
devise a framework for converting SLT results into direct, useful clinical information for patient care or 
pharmaceutical research. We, therefore, propose and preliminarily evaluate, a process whereby PLS, K-PLS, and 
Support Vector Machines (SVM) may be integrated with the accepted and well understood traditional biostatistical 
"gold standard", Cox Proportional Hazard model and Kaplan-Meier survival analysis methods. Specifically, this new 
combination will be illustrated with both PLS and Kaplan-Meier followed by PLS and Cox Hazard Ratios (CHR) and 
can be easily extended for both the K-PLS and SVM paradigms. Finally, these previously described processes are 
contained in the Fine Feature Selection (FFS) component of our overall feature reduction/evaluation process, which 
consists of the following components: 1.) coarse feature reduction, 2.) fine feature selection and 3.) classification (as 
described in this paper) and prediction. 

Results: Our results for PLS and K-PLS showed that these techniques, as part of our overall feature reduction 
process, performed well on noisy microarray data. The best performance was a good 0.794 Area Under a Receiver 
Operating Characteristic (ROC) Curve (AUC) for classification of recurrence prior to or after 36 months and a strong 
0.869 AUC for classification of recurrence prior to or after 60 months. Kaplan-Meier curves for the classification 
groups were clearly separated, with p-values below 4.5e-12 for both 36 and 60 months. CHRs were also good, with 
ratios of 2.846341 (36 months) and 3.996732 (60 months). 

Conclusions: SLT techniques such as PLS and K-PLS can effectively address difficult problems with analyzing 
biomedical data such as microarrays. The combinations with established biostatistical techniques demonstrated in 
this paper allow these methods to move from academic research and into clinical practice. 
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Introduction 

One of the most popular and challenging topics in 
bioinformatics research is gene selection from microar- 
ray data because it involves both statistical processing as 
well as biological interpretation. The statistical problems 
are daunting because of the large number of represented 
genes relative to the small number of samples. This pro- 
vides a prime opportunity to over-fit the data during the 
model building process. Biology is a significant compo- 
nent because identifying significant genes representative 
of a given clinical endpoint is a critical step toward 
understanding the biological process. Several conse- 
quences arise as a result of the statistical over-fitting 
problem. Very large Receiver Operating Characteristic 
(ROC) Area Under the Curve (AUC) values can be 
achieved on both training and validation data sets, but 
the results provided by these trained Complex Adaptive 
Systems (CAS) frequently fail to generalize to data sets 
other than training and validation sets. Furthermore, 
these CAS system designs do not necessarily operate on 
similar data sets with larger representative samples. Dif- 
ferent CAS solutions may produce different gene sets 
from the same set of microarray data. Consequently, any 
CAS should first attempt to achieve some sort of gener- 
alization ability. Secondly, because of the over-fitting 
problem described above, each proposed feature (or 
gene) reduction CAS generally is based on a unique the- 
oretical analysis, which means that how these separate 
CAS are connected is not well understood. Conse- 
quently, this difficulty results in the same problem sta- 
ted above: different algorithms will generate different 
prognostic gene sets using the same microarray data. 
This means that developing an underlying theory for 
feature selection would help to understand these algo- 
rithms as well as classify which of these are the "most" 
useful for gene selection. Song [1] presents a BAHSIC 
algorithm which claims to address this unifying algo- 
rithm principle proposal. BAHSIC defines a class of 
backward (BA) elimination feature selection algorithms 
that uses kernels and the Hilbert-Schmidt Independence 
Criterion (HSIC) [2]. Song demonstrates that the BAH- 
SIC algorithm encompasses the following well-known 
feature selection algorithms: (1) Pearson's correlation 
coefficient [3,4], (2) £-test [5], (3) signal-to-noise ratio 
[6], (4) Centroid [7,8], (5) Shrunken Certroid [9,10], and 
finally, (6) ridge regression [11]. These collective results 
suggest that the Evolutionary Programming driven Sup- 
port Vector Machine (EP-SVM) [12,13] with a choice of 
similarity, sum and product kernels might be a good 
wrapper/classification candidate for gene selection. This 
paper adapts a method, summarized in the methods sec- 
tion, originally developed for the social sciences and 
subsequently adapted to chemometrics, called Partial 



Least Squares (PLS) to this "feature-rich/case-poor" 
environment, as subsequently described, by theoretically 
attempting to eliminate those features which do not 
contribute to the "best" chromosome marker for lung 
cancer. 

Background of lung cancer 

Lung cancer is the leading cause of death in cancer 
patients worldwide. The American Cancer Society pre- 
dicts that 156,940 people will fall victim to the disease in 
2011, accounting for 27% of all cancer deaths [14]. The 
5-year survival rate of lung cancer patients is 16% pri- 
marily due to late stage diagnosis. Of the 221,130 esti- 
mated cases that will be diagnosed in 2011, 85% will have 
late stage tumors (stages II, III, IV) that have begun to 
advance. For these patients, treatment often includes sur- 
gical resection of tumors where possible, post-operative 
radiation and adjuvant chemotherapy. The 5-year survi- 
val rate for early stage (stage I), non-small cell lung can- 
cer (NSCLC) patients is 53% [14] and treatment often 
only includes surgical resection [15]. However, 35%-50% 
of these patients will suffer a relapse of the disease within 
5 years of surgery [16]. Post-operative chemotherapy can, 
in most cases, improve survival in early stage cancer 
patients. But its use is controversial. Doctors currently 
lack a validated and clinically accepted method to predict 
which patients are at a high risk of recurring cancer [17]. 
Those patients that are at a high risk of recurrence might 
benefit from post-operative adjuvant chemotherapy, 
whereas those patients that are at a low risk can be 
spared the side effects of chemotherapy [18]. 

Data set description and modifications 

The experiments designed used the gene expression 
profiles of 442 lung adenocarcinomas compiled by 
Shedded et al. [19]. These samples were compiled from 
six institutions and originally handled by a consortium 
that included: the University of Michigan (177 samples), 
the H. Lee Moffit Cancer Center (79 samples), the 
Dana-Farber Cancer Institute (82 samples) and the 
Memorial Sloan-Kettering Cancer Center (104 samples). 

It is important to note that in Dobbin et al. [20] these 
samples were shown to be comparable because the 
variability in gene expression values can be attributed 
more to the biology of the samples than to the institu- 
tion effect. As a result, the data can be combined for 
the purposes of our analysis despite being processed at 
different institutions. Furthermore, none of the patients 
in the study received pre-operative chemotherapy or 
radiation and at least two years of follow up information 
was available. Tumor samples were required to contain 
a minimum of 60% tumor cellularity for inclusion in the 
study with most containing 70%-90%. 
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Gene expression profiles of all samples were quanti- 
fied using the Affymetrix Human Genome-U133A Gen- 
eChip. The resulting CEL files generated at each of the 
four institutions were quantile normalized using the 
NCI_U133A_61L array as a reference. Final expression 
values were calculated using the DChip software (Build 
version February 2006) using the default settings. Each 
sample is characterized by 22,283 probes/genes (also 
referred to as features in this paper) as well as a host of 
clinical covariates including age, gender, and T/N cancer 
stage. A few minor discrepancies were found in the 
probe data obtained from the caArray website. First, 
probe 207140 at contained expression values of "NA" 
for all patients in the study. To mitigate this problem, 
the data corresponding to this probe were removed 
prior to our analysis. Secondly, patients Moff 18351, 
Moff 2362A and Moff 3009D did not have expression 
values for the 222086_s_at probe. In lieu of removing 
this probe entirely, the data for these patients were 
assigned an expression value equal to the mean 
(18.37114) of that probe's expression values across all 
other patients. The CEL files, DChip normalized expres- 
sion values and clinical information for all patients 
involved in this study are available through the caArray 
website https://array.nci.nih.gov/caarray/project/details. 
action?project.id=182. Other work with this data set is 
described in [18] and [19]. 

Experimental design for 3 and 5 Years 

To address the clinical issue of determining risk of 
recurrence delineated above, two classification experi- 
ments were designed. The first experiment classified 
NSCLC patients as "high risk" if cancer were likely to 
recur within 3 years of surgery and "low risk" otherwise. 
The 3 year cut-off was chosen because the majority of 
patients that do relapse will do so within the first 3 
years [16]. The second experiment objective was to clas- 
sify patients as "high risk" if cancer were likely to recur 
within 5 years of surgery, and "low risk" otherwise. This 
cut-off was chosen because, in current clinical practice, 
a patient that does not recur cancer within 5 years is 
often considered "cancer free" due to the low chance of 
recurrence after that time [21]. 

The first experiment contained 295 patients obtained 
from the Shedden et al [19] data set. This subset con- 
tained all patients for which cancer recurred as well as 
those that survived beyond 3 years (without recurrence). 
For purposes of training and validating the algorithms 
discussed in this paper, patients for which cancer 
recurred within 3 years (of surgery) are considered 
recurrent (high risk) and those that had recurrent can- 
cer, or survived without recurrence beyond 3 years, are 
considered non-recurrent (low risk). 



The second experiment was composed of 257 patients, 
which were also obtained from the Shedden et al [19] 
data set. All recurrent patients, as well as those that sur- 
vived without recurrence beyond 5 years, were included 
in this analysis. Patients for which cancer recurred 
within 5 years of surgery were considered recurrent 
(high risk). Those that recurred, or survived without 
recurrence beyond 5 years, were considered non-recur- 
rent (low risk). Finally, it also should be noted that 
Shedden et al [19] showed that the use of the clinical 
covariates age, gender and tumor stage during analysis 
improved the performance of most classifiers. Conse- 
quently, it was required that all patients included in the 
two experiments described above have available clinical 
information pertaining to the features: age, gender and 
tumor stage. 

Methods 

Overview of Feature Reduction/Classification Process 

Microarray data sets have a significant feature-rich/case- 
poor problem which can lead to over-fitting (i.e. models 
that produce excellent results on the training data exist, 
but none of which may be valid and have good perfor- 
mance on the test data) unless the number of features 
are significantly reduced prior to the generation of any 
classification or prediction model. The objective of this 
three-step process is to identify those significant features 
which are most useful in producing an accurate classifi- 
cation or prediction model. The process of feature 
reduction/classification is depicted in Figure 1, and con- 
sists of a Coarse Feature Reduction (CFR) process, fol- 
lowed by a Fine Feature Selection (FFS) process and 
then classification. 
Coarse Feature Reduction 

The automated CFR employees a simple two sample t- 
test followed by variance pruning (cut-off based on coef- 
ficient of variation). It is a simple process to remove lot 
of probes that are not useful for classification, Le., those 
not considered statistically significant to classification. 
See [22-24] for details on variance pruning. 
Partial Least Squares 

This section contains a brief, heuristic overview of Par- 
tial Least Squares (PLS). PLS is an extension of least 
squares regression (LSR). In LSR, the response variable 
y is predicted from p coordinates and n observations, 
denoted by X = { } , where each x\ e dt p . PLS 

finds "new variables" through the construction of speci- 
fic combinations of the original coordinates. These 
"latent variables" explain both the y response as well as 
the covariate space and are denoted by the following 
expressions: 

X = tipi + t 2 p 2 + - + t s p s + £ (1) 
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Figure 1 Feature (probe) reduction process. Description of 
feature reduction process 



y = hqi + t 2 q 2 + ... + t s q s + f (2) 
where: 

♦ t s = the s th latent variable (or conjugate vector; a n 
by 1 column vector). Generally most of the variabil- 
ity is characterized by M latent variables with a max- 
imum of M - 5 required for most problems. 

♦ p s and q s = the s th weight vectors (p s is a 1 by p 
row vector, q s is scalar). 

♦ e, £ = small errors in the remaining parts not 
explained by the latent variables. 

For this microarray data set, we began with 271 fea- 
tures after CFR and reduced this set to a minimum of 1 
latent variable and a maximum of 5 latent variables (see 
Results section). Therefore, the principle advantage of 



PLS for a problem of this type is its ability to handle a 
very large number of features: a fundamental problem of 
a feature-rich/case-poor data set. PLS then performs a 
least-squares fit (LSF) onto these latent variables, where 
this LSF is a linear combination that is highly correlated 
with the desired y response while, at the same time, 
accounting for the feature space variability. A summary 
of the features and advantages of PLS follows: 

♦ PLS algorithms are very resistant to over-fitting, 
when compared to LSR, and are fast and reasonably 
easy to implement. 

♦ For most problems with few data points and high 
dimensionality where PLS excels, a least squares 
solution may not be possible due to the singularity 
problem. 

♦ PLS regression maps the original data into a lower- 
dimensional space using a W projection matrix and 
computes a least squares solution in this space. See 
the algorithm below for the definition of W. 

♦ What makes PLS especially interesting for biome- 
dical and data mining applications is its extension 
using kernels, which leads to kernelized PLS (K- 
PLS), similar to the treatment in SVM. 

♦ PLS may be considered a better principal compo- 
nent analysis (PCA). 

- The first key difference from PCA is that PLS 
computes an orthogonal factorization of the 
input vector X and response y (note: y can also 
be a vector) in the process of computing the pro- 
jection matrix W. 

- The second key difference from PCA is that the 
least squares model for K-PLS is based on 
approximation of the input and response data, 
not the original data. 

- PLS and PCA use different mathematical mod- 
els to compute the final regression coefficients. 
Specifically, the difference between PCA and PLS 
is that a new set of basis vectors (similar to the 
eigenvectors of X T X in PCA) is not a set of suc- 
cession of orthogonal directions that explain the 
largest variance in data, but rather are a set of 
conjugate gradient vectors in the correlation 
matrices which span a Krylov space. 

An algorithm of PLS paradigm follows: 

1. Let: X 1 = X,yi = y 

2. For m = 1 to M, where M = the desired number of 
latent variables, do: 

(a) Compute direction of maximum variance w m 

(b) Project X onto w t m = X m w m 

(c) Normalize t t m = t m l\t m \ 
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(d) Deflate X X m+1 = X m ^t rn {t m ) 1 X m 

(e) Deflate y y m+1 = y m -t m {t m ) T y m 

(f) Normalize Rafter deflation y m+1 = y m +\l\y m +\\ 

3. Finally, compute the regression coefficients using 
latent variables: p = W{T T XW)' l T T y 

where: 

♦ w m is the m th column vector of W 

♦ t m is the m th column vector of T 

♦ X m and y m are the input matrix and response vec- 
tor that are being deflated, and /3 is the linear regres- 
sion coefficient vector. A geometric representation of 
part of the algorithm and the insight of deflation can 
be seen in Figure 2. 

Kernelized Partial Least Squares 

Non-linear relationships between variables may be found 
by embedding this data into a kernel induced feature 
space. See [25] for a good reference of kernel learning. 
This kernel "trick" is used in PLS and is called K-PLS. 
Consider now a mapping q>, which maps any data vector 
from the sample space to some other (possibly infinite 
dimensional) Euclidean space % (feature space): 

0 : m n -> U (3) 

The mapping will "recode" the data set as: 



{(0Oi),yi), (0(*2),y2)/ • • • i&M'Yn)} 



(4) 



This mapping of the data set is from non-linear input 
space to a linear feature space. That is, although the 
environment data representation in the input X space is 
non-linear, after the data are processed by the (p 



mapping, the data characterized by this mapping is lin- 
ear in %, with the happy result that linear techniques 
may be used on the mapped data while preserving the 
non-linear properties represented in the input space. 
This mapping is accomplished, as previously stated, by 
using a valid kernel function. 

Adding this kernel-induced capability to the PLS 
approach means that a real time, non-linear optimal 
training method now exists which can be used to per- 
form computer aided diagnosis. A second advantage of 
this approach is that a kernel function K(xi,x 2 ) com- 
putes the inner products ((p{x 1 ) ) (p{x 2 )) in the feature 
space % directly from the samples X\ and x 2 > without 
having to explicitly perform the mapping, making the 
technique computationally efficient. This is especially 
useful for algorithms that only depend on the inner pro- 
duct of the sample vectors, such as SVM and PLS. 

Computationally, kernel mappings have the following 
important properties: (1) they enable access to excep- 
tionally high (even infinitely) dimensional and, conse- 
quently, very flexible feature space, with a 
correspondingly low time and space computational cost, 
(2) they solve the convex optimization problem without 
becoming "trapped" in local minimal and, more impor- 
tantly, (3) the approach decouples the design of the 
algorithms from the specifications of the feature space. 
Therefore, both learning algorithms and specific kernel 
designs are not as difficult to analyze. 

The algorithm used to develop the K-PLS model, is 
given below. Details can be found in [26]. 

1. Let K Q = (Kij = (pfo), (p(xj)) = K(x b xj)) {ij = Um) be the n 
by n Gram matrix induced by K, the selected kernel func- 
tion corresponding to (p(<). Let Ki be the centered form of 



What is Deflation? 



Original Data 




Line L spanned by iV" vector 



Subtract projection on L to deflate 



Projection onto L 



Figure 2 Deflation. The geometric interpretation of the 'deflation' step in the PLS Algorithm. This 'Deflation' effectively removes one dimension 
by projecting the data onto a subspace that is one dimension less than the row number of the current data matrix, and orthogonal to the 
vector W 71 . 
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K 0 ,y be the be the response variable, normalized to have a 
mean of 0 and a standard deviation of 1, and M be the 
desired number of latent variables. 

2. For m = 1 to M, do: 

(a) t m = K m ' y m 

( b ) tm = 

(c) K m+1 = (J — t m t^)K m (I - t m t T m ) 

(d) y m+ i =yj- 1 (tmtl l )y m 

(e) = — " 

llym+lll 

3. Finally, compute the regression coefficients using: 
a = Y(T T KlY)- l T T Y where: 

♦ / is an m x m identity matrix 

♦ K m is the Gram matrix 

♦ t m and y m are the columns of T and F 
respectively 

4. The regression equation then becomes: 

n 

i=i 

Note x is any sample from the testing data to be pre- 
dicted and Ki(Xi,x) is element from the centered form of 
the training/testing kernel matrix. 
Evolutionary Programming derived K-PLS machines 
The particular K-PLS kernel types and kernel para- 
meters were derived using an evolutionary process 
based on the work of Fogel [27] called Evolutionary Pro- 
gramming (EP). EP is a stochastic process in which a 
population of candidate solutions is evolved to match 
the complexity and structure of the problem space. 

This process iteratively generates, valuates, and selects 
candidates to produce a near-optimal solution without 
using gradient information, and is therefore well suited 
to the task of simultaneously generating both the K-PLS 
model architecture (kernel) and parameters. Figure 3 
and found in more detaA description of this process is 
shown in Figuil below. 

1. Initial K-PLS parameter population created: A 
population of candidate solutions (K-PLS kernel 
architectures and parameters) is randomly generated. 

2. Mutation of K-PLS machines: Each of these candi- 
date solutions then is copied and mutated, yielding a 
solution pool of twice the original size, using the 
equation given below: 

v \ = Vi e VV^ W 



where m is the total number of configurable 
parameters being evolved, N{0,1) is a standard 
normal random variable sampled once for all m 
parameters of the v vector, and iV/(0,l) is a stan- 
dard normal random variable sampled for each 
of the m parameters in the v vector. 
The second step of this mutation process com- 
prises the updating of each configurable para- 
meter for all elements of the evolving population. 
If we let the vector y t denote these elements for 
each of the individual member of the population, 
this update process will be accomplished as fol- 
lows: 

yf = n + Cv\ (7) 



Here C is a standard Cauchy random variable. It 
is used because it has longer tails and offers bet- 
ter mutation performance. 
3. Selection of K-PLS machines: All elements of this 
pool are scored using an objective function. These 
objective function scores are then used to order the 
candidate solutions from the "most fit" to "least fit." 
Better results usually are obtained from using tour- 
nament selection methodologies. With tournament 
selection, each candidate solution competes against a 
random subset of the remaining solutions. Finally, 
the upper 50% of the solution pool is selected to 
continue as the basis for the next generation and the 
remaining 50% are "killed off (discarded) to reduce 
the pool to the original population size. This process 
is generally repeated for a specified number of gen- 
erations, unless some other "stopping" criteria is 
used. 

For more details on the EP process, refer to our pre- 
vious work [28]. 

Support Vector Machine and its capacity to reach the 
global optimum 

The K-PLS results were validated by using another ker- 
nel-based Statistical Learning Theory model called a 
Kernelized Support Vector Machine (K-SVM). SVMs 
was developed by Vapnik [29-31]. Tutorials to SVM can 
be found in [32] and [25]. 

The discussion below provides the theoretical explana- 
tion for why SVMs can always be trained to a global 
minimum, and thereby should provide better diagnostic 
accuracy, when compared with neural network perfor- 
mance trained by back propagation. 

Assume there exist n experimentally derived observa- 
tions. Each observation (or training example) consists of 
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Figure 3 Evolutionary Programming. Shown is the process of the Evolutionary Programming optimization technique utilized to find the 

optimal kernel parameters. The process creates an initial population of candidate solutions (chromosomes) which undergo a stochastic search 

for the optimal parameter through the sub-processes of mutation and tournament selection of the 'most-fit' genes. 
^ J 



a vector x t containing the input pattern and a corre- 
sponding known classification y t . The objective of the 
learning machine is to formulate a mapping x t — > y t . 
Now consider a set of functions f(x,a) with adjustable 
parameters a, that defines a set of possible mappings x 
— » f{x>a). Here, x is given and a is chosen. In the case 
of a traditional neural network of fixed architecture, the 
a values would correspond to the weights and biases. 
The quantity R(a), known as the expected risk, asso- 
ciated with learning machines is defined as: 

Hot) = j ^\y-f{x,a)\p{x,y)dxdy (8) 



where, p(x, y) is an unknown probability density func- 
tion from which the examples were drawn. This risk 
function is the expected value of the test (or validation) 
error for a trained learning machine. It may be shown 
that the best possible generalization ability of a learning 
machine is achieved by minimizing R{a), the expected 
risk. There exists a error bound of generalization, for 
binary classification, which holds with the probability of 
at least I-77, 0<7]<1 for all approximating functions 
that minimize the expected risk. 
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The first term on the right hand side, R emp {oc), is 
known as the "empirical risk", expressed by: 

1 n 

Remp{a) = J^J2 ^ °0 1 ( 10 ) 

i=i 

Empirical risk is a measure of the error rate for the 
training set for a fixed, finite number of observations. 
This value is fixed for a particular choice of a and a 
given training set {(x if yi) t i - 1,2, — n}. The second term 
in (9) is the "Vapnik-Chervonenkis (VC) confidence 
interval" This term is a function of the number of train- 
ing samples n, the probability value 77 and the VC 
dimension h. The VC dimension is the maximum num- 
ber of training samples that can be learned by a learning 
machine without error for all possible labeling of the 
classification functions f(x,a), and is, therefore, a mea- 
sure of the capacity of the learning machine. In tradi- 
tional neural network implementations, the confidence 
interval is fixed by choosing a network architecture a 
priori. Neural network training by back-propagation 
minimizes the empirical risk only. 

In contrast to neural network, in a SVM design and 
implementation, not only is the empirical risk mini- 
mized, the VC confidence interval is also minimized by 
using the principles of structural risk minimization 
(SRM). Therefore, SVM implementations simultaneously 
minimize the empirical risk as well as the risk associated 
with the VC confidence interval, as defined in the above 
expression. The bound in (9) also shows that as n — > 00, 
the empirical risk approaches the true risk because the 
VC confidence interval risk approaches zero. The reader 
may recall that obtaining larger and larger sets of valid 
training data would sometimes produce (with a great 
deal of training experience) a better performing neural 
network using classical training methods. This restric- 
tion is not incumbent on the SRM principle and is the 
fundamental difference between training neural net- 
works and training SVMs. Finally, because SVMs mini- 
mize the expected risk, they provide a global minimum. 
Measures of similarity for classification provided by various 
kernels 

Understanding what similarity as applied to K-PLS and 
K-SVM often provides additional insight in proper ker- 
nel selection. Therefore, we now consider kernel func- 
tions and their application to K-PLS and K-SVMs. K- 
PLS and K-SVM solutions in non-linear, non-separable 
learning environments utilize kernel based learning 
methods. Consequently, it is important to understand 
the practical implications of using these kernels. Kernel 
based learning methods are those methods which use a 
kernel as a non-linear similarity to perform compari- 
sons. That is, these kernel mappings are used to con- 
struct a decision surface that is non-linear in the input 



space, but has a linear image in the feature space. To be 
a valid mapping, these inner product kernels must be 
symmetric and also satisfy Mercer's theorem [33]. The 
concepts described here are not limited to K-PLS and 
K-SVMs, and the general principles also apply to other 
kernel based classifiers as well. 

A kernel function should yield a higher output from 
input vectors which are very similar than from input 
vectors which are less similar. An ideal kernel would 
provide an exact mapping from the input space to a fea- 
ture space which was a precise, separable model of the 
two input classes; however, such a model is usually 
unobtainable, particularly for complex, real-world pro- 
blems, and those problems in which the input vector 
provided contains only a subset of the information con- 
tent needed to make the classes completely separable. 
As such, a number of statistically-based kernel functions 
have been developed, each providing a mapping into a 
generic feature space that provides a reasonable approxi- 
mation to the true feature space for a wide variety of 
problem domains. The kernel function that best repre- 
sents the true similarity between the input vectors will 
yield the best results, and kernel functions that poorly 
discriminate between similar and dissimilar input vec- 
tors will yield poor results. As such, intelligent kernel 
selection requires at least a basic understanding of the 
source data and the ways different kernels will interpret 
that data. 

Some of the more popular kernel functions are the 
(linear) dot product (11), the polynomial kernel (12), the 
Gaussian Radial Basis Function (GRBF) (13), and the 
Exponential Radial Basis Function (ERBF) (14), which 
will be discussed below. 

The dot and polynomial kernels are given by, 

K(u,v) = u-v= ||2||||v||cos(0), (11) 

andK(2,y) = (u-v + \) d , (12) 

respectively, both use the dot product (and therefore 
the angle between the vectors) to express similarity; 
however, the input vectors to the polynomial kernel 
must be normalized (Le„ unit vectors). This restricts the 
range of the dot product in (12) to ±1, yielding kernel 
outputs between 0 and 2 d , where d is the degree of the 
polynomial. The implication of the dot product kernel 
having a positive and negative range (versus the strictly 
non-negative polynomial kernel) is that the classification 
process can learn from the unknown vector's dissimilar- 
ity to a known sample, rather than just its similarity. 
While the dot product kernel will give relatively equal 
consideration to similar and dissimilar input vectors, the 
polynomial kernel will give exponentially greater consid- 
eration to those cases which are very similar than those 
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that are orthogonal or dissimilar. The value of d deter- 
mines the relative importance given to the more similar 
cases, with higher values implying a greater importance. 
Measures of similarity for these two kernels are depicted 
in Figures 4 and 5. 

The Gaussian and Exponential RBF kernels are given 
by: 



and K(u, v) = e~ 2<r* ' U4j 
respectively. 

The Gaussian and Exponential RBF kernels use the 
Euclidean distance between the two input vectors as a 
measure of similarity instead of the angle between them 
(see Figures 6 and 7). 

Since ||u - v|| is always non-negative, both kernels 
achieve a maximum output of one when | |u - v \ \ = 0, 
and approach zero as \ \u - v|| increases. This approach 
is made faster or slower by smaller or larger values for 
O", respectively. Figure 6 shows the output of the GRBF 
kernel as a function of the distance between the input 
vectors for several different values of O". Figure 7 shows 
the output of the ERBF kernel. 

It is clear from Figures 6 and 7 that the distance at 
which the kernel output reaches approximately zero var- 
ies with a, and therefore the choice of a for this kernel 



is essential in properly distinguishing the level of simi- 
larity between two input vectors. If the value of a is too 
small-that is, most pairs of vectors are far enough apart 
that the kernel output is near zero, the SVM will have 
too little information to make an accurate classification. 
If the value of a is too large, so that even very distant 
pairs of input vectors produce a moderate output, the 
decision surface will be overly smooth. This may mask 
smaller distinctive characteristics which exist in the 
ideal decision surface, and will also increase the effect 
outliers in the training data have on the classification of 
an unknown point. 

Using PLS, KPLS, and SVM in clinical research 

While the methods covered in this paper offer statisti- 
cally significant improvements in diagnostic and prog- 
nostic biomedical applications, there has been great 
difficulty in utilizing advances such as these in clinical 
research. The statistics used to evaluate the performance 
of these techniques are not readily converted into direct 
clinical information that may help in patient care or 
pharmaceutical research. In order to address this, we 
have devised a framework to combine these techniques 
with well accepted and understood traditional biostatis- 
tics methods, the Cox Proportional Hazard model and 
the Kaplan-Meier (K-M) Curve. These two techniques 
each help address the question of how important a par- 
ticular parameter is to evaluating risk/survival. The fol- 
lowing subsections will give a basic overview of how 
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Cox and K-M can be combined with our techniques. 
For simplicity, such a combination will only be 
described with PLS, though it could just as easily be 
done with KPLS or SVM. 



PLS and Kaplan-Meier curves 

Developed in the 1950s, the K-M curve is the gold stan- 
dard in survival analysis [34]. In a normal survival curve, 
the number of survivors at a particular moment in time 
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is divided by the total number of patients. These points 
are plotted against time to give a curve which starts at 1 
and slowly curves downward until at some time when it 
reaches 0. A K-M curve introduces an additional ele- 
ment, the ability to utilize censored data. Censored data 
is partial data; where a final survival time is unknown 
but a minimum survival time is known. This can happen 
when patients die of unrelated causes, patient data is 
lost, or patients no longer keep contact with the 
researcher. To handle this censored data, when the par- 
tial survival time is reached, the patients are removed 
from the number of survivors and the total number of 
patients. These removals are marked on a K-M curve 
with a cross. The best way to utilize a K-M curve is to 
create different curves for different groups and compare 
them. For instance, a K-M curve for men and one for 
women would be far apart if a particular condition was 
much more fatal in one gender than the other. Using 
this concept with our techniques, we can use the PLS 
(or KPLS/SVM) to split a data set of patients into good 
and poor prognosis categories. This can be done by first 
splitting training data around some cut-off survival time 
(survival being the lack of recurrence), such as survival 
before or after 36-months, and training the system to 
make predictions on a validation set. K-M curves can 
then be made for those predicted groups, and if the dif- 
ference between them is significant, then the system is 
performing well. A chi-square test is the standard for 
comparing curves, and a ^-value derived from that test 
of below .05 would indicate statistically significant 



difference between the two prognosis categories pre- 
dicted by PLS. 
PLS and Cox Hazard Ratios 

Another common survival analysis technique is the Cox 
Proportional Hazard model [35]. The Cox model is a 
semi-parametric linear regression model which assumes 
that the hazard of an observation is proportional to an 
unknown "baseline" hazard common to all observations. 
Proportionality to this baseline, it is modeled as an 
exponential of a linear function of the covariates. From 
this model, a single Cox Hazard Ratio (CHR) value is 
derived which represents the "risk" of an event occur- 
ring associated with being in a particular group. The lar- 
ger the CHR, the greater the risk over time of the event 
occurring for one group than the other. Similar to the 
K-M curve, the PLS can separate patients into two prog- 
nosis categories, and the CHR will be a measure of the 
effectiveness of that categorization. A large ratio would 
indicate that the output of this method was a useful 
prognostic prediction for a patient to have recurrence. 

Results 

The goal of the experiments discussed herein were to 
derive models from the microarray data to classify each 
sample as belonging to either the class of recurrent or 
non-recurrent patients. The class of non-recurrent sam- 
ples are those samples belonging to patients which, after 
being treated did not recur cancer before the given cut- 
off period. Patients that did recur cancer before the cut- 
off period are considered to belong to the recurrent 
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class. Two separate experiments were performed with 
cut-off periods of 36 and 60 months respectively. 

As mentioned in the Methods section, the data were 
pre-processed using CFR, followed by FFS, and finally 
classification model building and evaluation. 

Coarse Feature Reduction 

For the 36 month classification experiment, CFR was 
used to reduce the original number of features 
(probes) from 22,282 to 2,675 using a hard cut-off t- 
test p-vdlue of 0.05. Then, this probe count was further 
reduced to 594 using a coefficient of variation cut-off 
of 0.632. In like manner, using CFR for the 60 month 
classification experiment, the number of probes was 
reduced from 22,282 to 829 using the same hard cut- 
off £-test ^-value of 0.05. This number was then 
further reduced to 212 using coefficient of variation 
cut-off of 0.641. After reducing the initial feature set 
using the CFR technique, the process of FFS and clas- 
sification was performed. 

Fine Feature Selection/Classification 

Fine Feature Selection using Partial Least Squares 

In this section, we use the AUC value as the fitness 
metric to evaluate the relative worth of the classification 
model. Higher AUC values are indicative of better clas- 
sifiers, with an AUC value of 1.0 indicating a perfect 
classifier, which is arguably impossible for any non-tri- 
vial classification task. 

The FFS process utilizes the weight vector of the first 
latent variable generated by the Linear PLS (L-PLS) 
algorithm to ascertain feature importance. The most 
important features (those with the largest corresponding 
weight vector components) are ranked highest and fea- 
tures with lower corresponding components are dis- 
carded. This step, called Fine Feature Selection, provides 
a ranking of importance, which means the magnitude of 
each feature's respective component is directly corre- 
lated with its predictive power in the model. 

The FFS process builds this "importance metric" by 
iterating the analysis of the weight vectors of randomly 
assigned training folds 10,000 times employing three 
sensitivity settings, where these three sensitivities score 
the top 20, 30, and 150 most influential performers for 
each of their respective 10,000 runs, based on each fea- 
ture's weight in the weight vector of L-PLS. For exam- 
ple, if 'Age' has the largest component and 'Sex' has the 
second largest in the top 30 sensitivity setting, the score 
for 'Age' would be 30 and that for 'Sex' would be 29. 
For each run time, the data is split randomly into train- 
ing and validation folds. These data are normalized then 
analyzed using Linear PLS and the weight vector is 
extracted, sorted, and 'winning' features have their 
scores updated by position. 



In each of the three settings, a number, p, of features 
are retained based on theirs aggregated score over 
10000 runs. The number to retain, p, is user-specified. 
In FFS, we tried several p values, with increments of 50 
features, beginning at 50 and ending at 550. By gradually 
increasing the size of feature retention, one can empiri- 
cally optimize the number of features for classification/ 
prediction. Lastly, a 'global (most important) feature set 
(S FFS y is created, which is the union of the retained fea- 
ture sets from all three settings. These S FFS features are 
the final product of the FFS process and the only ones 
included in the construction of the refined input data 
matrix, X FFS . In summary, S FFS is given by: 

S P FFS = S p 20 S p 30 S p 150 (15) 

where S P FFS = the union set of all three top performing 
feature subsets, S p - each setting's top performers, / = 
20, 30, 150, and p = the number of features retained in 
each setting. Note that the number of features in S P FFS 
may not be exactly 3p or p. 

In our study, we have selected 361 and 102 features 
using this FFS process for the 36- and 60-month experi- 
ment respectively, from the 594 and 212 features that 
were selected by CFR. 
Comparisons using PLS classification 

As noted, we compared four separate models' perfor- 
mances based on different data: L-PLS and K-PLS Poly- 
nomial Kernel (KPLS-Poly) based on the Coarse Feature 
Reduced (CFR) data, and on the Fine Feature Selected 
(FFS) data respectively (the FFS-data is actually pro- 
cessed by both CFR and FFS). 

♦ We sought out to determine which model pro- 
duced the most accurate prediction of recurrence. 

♦ We also sought to determine whether the data was 
linear or non-linear, which was determined by which 
class of model yielded better results: L-PLS or K- 
PLS with non-linear kernels. 

♦ Finally, we sought to determine the effectiveness of 
our PLS weight vector-based Fine Feature Selection 
method. This was determined by the comparison 
between the validation AUC values for the same 
models on the CFR-data and the FFS-data. If the 
results on the FFS-data are better than the CFR- 
data, then FFS is effective. 

What we found was that both the 36-month and 60- 
month data sets were inherently linear in nature, mean- 
ing the L-PLS gave better AUC values on validation 
folds. These results can be seen in Table 1. This is a 
particularly surprising find, considering most real world 
phenomena are non-linear by nature. Yet, this was veri- 
fied by our K-PLS Evolutionary Programming 
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Table 1 Model Comparison The comparison of optimal performance values and number of latent variables for three 
independent models on the 36- and 60-month data 

(CFR-data) Model Top Validation AUC Value (36 mo/60 mo) Number of Latent Variables (30 mo/60 mo) 

L-PLS .791 /.831 3/2 

KPLS-Poly (Degree = 1) .784/.830 3/1 

SVM .78/- 

The best performance was seen with the L-PLS, out-competing the non-linear SVM and KPLS techniques in AUC performance. The number of latent variables 
required for the PLS-based techniques was no more than three for both data sets. 



optimization technique which selected polynomial ker- 
nel parameter of degree 1 as the best performer (K-PLS 
with polynomial kernel whose degree equals 1 is equiva- 
lent to L-PLS.) The validation AUC values, as we will 
show, were near equivalent for L-PLS and the KPLS- 
Poly of degree 1. As a second verification of our results, 
a Support Vector Machine (Lib-SVM) [36] analysis on 
the same data supported our findings by producing the 
same validation AUC values with the polynomial kernel 
The LibSVM analysis also supported our pre-analysis 
which showed extremely poor performance of the Gaus- 
sian/Exponential RBF kernels on these data sets. Due to 
this inadequate performance, we did not continue our 
study with the RBF kernels. 

In addition to these findings, the number of latent 
variables required to reach optimal performance, by L- 
PLS and KPLS-Poly when they are applied to the FFS 
processed data was roughly the same (see Figures 8 and 
9 for 36 months and 60 months respectively). In the 
case of 36 months, the best number of latent variables is 
3 (Figure 8) for both L-PLS and KPLS-Poly models. For 
the 60-month data set (Figure 9), the KPLS-Poly had a 
slightly lower required number, 1, for latent variables 



than the PLS model, which requires 2. As is also shown 
in Table 1, the KPLS-Poly reached the maximum perfor- 
mance at 1 latent variable whereas the L-PLS reached 
maximum at 2 latent variables for the 60-month experi- 
ment. This means that the KPLS-Poly analysis did not 
require as much smoothing of the data to reach its opti- 
mal validation AUC value. This is also indicative that 
the best system generalization was seen between 1 and 
3 latent variables, which is typical to most data analyses 
using these techniques (most are less than 5). 

The analysis of the efficacy of our PLS weight vector- 
based FFS technique in reducing noisy features shows 
that is effective only for the L-PLS method. The results 
can be seen in Table 2. In both the 36-and 60-month 
datasets, top performance was only improved in terms 
of AUC value for the L-PLS. It is to our belief that this 
is due to the fact that we base our FFS of the features 
on their linear combination of the contributions that 
they have to the time of recurrence. We believe that the 
use of a KPLS-based method embedded in the FFS pro- 
cess would capture those features which show non-lin- 
ear contributions to the response variable. The KPLS- 
Poly model was, in both cases, impacted by the removal 
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of some features which must have had a critical role in 
its classification model This was seen more severely in 
the 60-month data set, may be due to the fact that it, 
from the beginning, had half the amount of features 
than the 36-month after CFR. 
SVM Verification of K-PLS polynomial results 
The 36 month KPLS-Poly AUC result of 0.784 was not 
expected when compared with the L-PLS result AUC 
result of 0.791 because these classification problems are 
generally non-linear. We therefore validated this result 
with an independent analysis using SVM using several 
kernels with the exact same data set and cross-validation 
process. Specifically, the data was normalized and for- 
matted for use with LibSVM [36], a widely-used SVM 
implementation. A grid search was implemented to find 
good parameters for each of the built-in kernels, Gaus- 
sian RBF, Sigmoid, and polynomial. A linear SVM was 
not considered as it would not be a good comparison to 
K-PLS. With the grid search including four parameters: 
gamma, coefficient, degree, and C (regularization para- 
meter), the polynomial kernel was found to be the best 
performer. Using 1-hold-out cross-validation, the best 
results found by this method was -0.78 (which agrees 
with the K-PLS polynomial result to within 0.51%), 



though most parameter configurations usually gave an 
output of .63-. 73. No stochastic optimizer was used, so 
it may be possible for slightly higher performance (and 
slightly better agreement) with a exhaustive EP para- 
meter search. Other results in Table 2 above were not 
verified because of the exhaustive analysis performed for 
this 36 month K-PLS polynomial result. 

Kaplan-Meier and Cox 

K-M curves for both PLS using 36 and 60 months can 
be seen in Figures 10 and 11 respectively. Using 36 
months as the cut-off for training, the resulting K-M 
curves for the two categories have a very significant dif- 
ference of p = 4.744e-12. For 60 months, the p-value 
was so low as to only give ~0 as a result of calculations. 
A higher precision computation tool may be capable of 
a more specific result. However, these results make it 
very clear that PLS is easily able to separate patients 
into groups of recurrence sooner and later. The CHR 
between the two categories using the 36-month cut-off 
for training was 2.846341 (2.088547 and 3.879089 for 
95% confidence). For 60-months, it was 3.996732 
(2.828351 and 5.647768 for 95% confidence). These 
numbers show a significantly increased risk of 



Table 2 CFR and FFS Comparison The comparison of model performance on data from the Fine Feature Selection 
process and the Coarse Feature Reduction 



Model 


Top Validation AUC Value CFR-data (36 mo/60 mo) 


Top Validation AUC Value FFS-data (36 mo/60 mo) 


L-PLS 


791/.831 


J94/.869 


KPLS-Poly (Degree = 1) 


.784/. 830 


.780/711 



The FFS process enhanced performance for only the L-PLS while the KPLS-Poly suffered in both the 36- and 60-month data. 



Land et al. BMC Systems Biology 201 1, 5(Suppl 3):S13 
http://www.biomedcentral.eom/1 752-0509/5/S3/S1 3 



Page 15 of 18 





PLS using 36 Month Threshold 


q _ 


Vl 

\t\ 1 

JPir- 

k 1 1 "* — ^ 

C\_ * 
I 1 1 1 'i 


Classification Categories 

■ Good Prognosis Class 

■ Poor Prognosis Class 




CO _ 
O 


!i< - \ ^ 

i'i - 

V \ TL 














obabi 




H 1 in i 




- — 


A i — -. 

nr_ Ti. . ' - 1 












d ~ 






q 

d ~ 








: 


1 1 

50 100 


L 1 

150 200 




Survival Time (Months) p= 4J44e-12 


Figure 10 PLS at 36 Month Threshold. Kaplan-Meier curve of PLS predicted groups using 36-month threshold 



recurrence over time with being in the poor prognosis 
group versus the good prognosis group. These two sta- 
tistics, the K-M curve derived ^-values and CHR, are 
values which can be directly understood by clinicians 
without further training. In other words, with this fra- 
mework, any new patient's data could be sent to us by 
any doctor who reads this article, given a categorization 
by the system as it is currently setup, and then the doc- 
tor can take that knowledge and make decisions on how 
frequent to make checkups and other treatment 
decisions. 

Conclusions 

Our microarray analysis and information extraction 
method comprised three basic components drawing 
from Statistical Learning Theory: 1.) Coarse Feature 
Reduction, 2.) Fine Feature Selection and 3.) 
Classification. 

In Coarse Feature Reduction, the original 22,282 
probes were reduced to 594 for the 3 year cut-off 
(97.5% reduction) and to 212 for the 5 year cut-off 
(99.04% reduction) using basic t-test and variance 



pruning techniques. The Fine Feature Selection was able 
to further reduce the number of features to 361 for the 
60-month and 102 for the 36-month data sets (a further 
reduction of 39.2% and 51.9%). The FFS process has 
been demonstrated to reduce the noise in the data by 
filtering out noisy features from the data set produced 
by the CFR process. By implementing the FFS process 
in our analysis, we were able to enhance the perfor- 
mance of our classifier. 

After utilizing the FFS process, classification compari- 
son is made for the refined data. The optimal classifying 
performance of L-PLS was observed at 3 latent variables 
and 2 latent variables for the 36- and 60-month experi- 
ments, respectively. Similar results were obtained, a 
reduction to 3 and 1 latent variables, when using L-PLS 
on data refined only by CFR. The Area Under the Curve 
(AUC) measure of performance varied from 0.791 to 
0.869, depending upon the particular L-PLS or K-PLS 
and SVM model used (see Tables 1 and 2). PLS results 
for the 36-month cut-off were independently verified 
using Support Vector Machines. In summary, it is 
important to note that by using the SLT techniques, 
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over 22,000 probes were eventually reduced to 3 and 2 
latent variables (for the 36- and 60-month cut-off peri- 
ods, respectively) while still maintaining AUC values in 
the range of 0.79 to 0.86. 

This research also provided a secondary and clinically 
important result, which is that the improved SLT meth- 
ods/paradigms can be integrated into the widely 
accepted and well understood traditional bio-statistical 
Cox Proportional Hazard model and the K-M methods. 
For example, using the SLT paradigms as pre-processors 
for K-M, the resultant probability vs. survival time cate- 
gories have a very significant difference (p = 4.74e-12) 
for the 36-month cut-off and a p ~0 for the 60-month 
cut-off. (Figures 10 and 11, respectively). These results, 
therefore, make it clear that PLS easily and accurately 
separates patients into groups of sooner and later recur- 
rence. Furthermore, the CHR between the two cate- 
gories for the 36-month cut-off was 2.85 (2.09 to 3.88 
for 95% confidence). For the 60-month cut-off the ratio 
was 3.99 (2.83 to 5.65 for 95% confidence). (Figures 10 
and 11, respectively). These results show a significant 
increased risk of recurrence over time when classified as 



being a member of the poor group vs. the good group. 
Consequently, these two results (K-M derived ^-values 
and the CHR), which are directly understood by practi- 
cing clinicians without additional training and were pre- 
processed using the PLS and KPLS algorithms, was 
made possible by the SLT pre-processing we applied in 
this study. 
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